16 research outputs found
MixTailor: Mixed Gradient Aggregation for Robust Learning Against Tailored Attacks
Implementations of SGD on distributed systems create new vulnerabilities,
which can be identified and misused by one or more adversarial agents.
Recently, it has been shown that well-known Byzantine-resilient gradient
aggregation schemes are indeed vulnerable to informed attackers that can tailor
the attacks (Fang et al., 2020; Xie et al., 2020b). We introduce MixTailor, a
scheme based on randomization of the aggregation strategies that makes it
impossible for the attacker to be fully informed. Deterministic schemes can be
integrated into MixTailor on the fly without introducing any additional
hyperparameters. Randomization decreases the capability of a powerful adversary
to tailor its attacks, while the resulting randomized aggregation scheme is
still competitive in terms of performance. For both iid and non-iid settings,
we establish almost sure convergence guarantees that are both stronger and more
general than those available in the literature. Our empirical studies across
various datasets, attacks, and settings, validate our hypothesis and show that
MixTailor successfully defends when well-known Byzantine-tolerant schemes fail.Comment: To appear at the Transactions on Machine Learning Research (TMLR
APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations
Recent advances in learning aligned multimodal representations have been
primarily driven by training large neural networks on massive, noisy
paired-modality datasets. In this work, we ask whether it is possible to
achieve similar results with substantially less training time and data. We
achieve this by taking advantage of existing pretrained unimodal encoders and
careful curation of alignment data relevant to the downstream task of interest.
We study a natural approach to aligning existing encoders via small auxiliary
functions, and we find that this method is competitive with (or outperforms)
state of the art in many settings while being less prone to overfitting, less
costly to train, and more robust to distribution shift. With a properly chosen
alignment distribution, our method surpasses prior state of the art for
ImageNet zero-shot classification on public data while using two orders of
magnitude less time and data and training 77% fewer parameters
Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement
We propose Dataset Reinforcement, a strategy to improve a dataset once such
that the accuracy of any model architecture trained on the reinforced dataset
is improved at no additional training cost for users. We propose a Dataset
Reinforcement strategy based on data augmentation and knowledge distillation.
Our generic strategy is designed based on extensive analysis across CNN- and
transformer-based models and performing large-scale study of distillation with
state-of-the-art models with various data augmentations. We create a reinforced
version of the ImageNet training dataset, called ImageNet+, as well as
reinforced datasets CIFAR-100+, Flowers-102+, and Food-101+. Models trained
with ImageNet+ are more accurate, robust, and calibrated, and transfer well to
downstream tasks (e.g., segmentation and detection). As an example, the
accuracy of ResNet-50 improves by 1.7% on the ImageNet validation set, 3.5% on
ImageNetV2, and 10.0% on ImageNet-R. Expected Calibration Error (ECE) on the
ImageNet validation set is also reduced by 9.9%. Using this backbone with
Mask-RCNN for object detection on MS-COCO, the mean average precision improves
by 0.8%. We reach similar gains for MobileNets, ViTs, and Swin-Transformers.
For MobileNetV3 and Swin-Tiny, we observe significant improvements on
ImageNet-R/A/C of up to 20% improved robustness. Models pretrained on ImageNet+
and fine-tuned on CIFAR-100+, Flowers-102+, and Food-101+, reach up to 3.4%
improved accuracy. The code, datasets, and pretrained models are available at
https://github.com/apple/ml-dr.Comment: Accepted at International Conference on Computer Vision (ICCV) 2023.
Camera-ready version with new Tables 9 and 1
TiC-CLIP: Continual Training of CLIP Models
Keeping large foundation models up to date on latest data is inherently
expensive. To avoid the prohibitive costs of constantly retraining, it is
imperative to continually train these models. This problem is exacerbated by
the lack of any large scale continual learning benchmarks or baselines. We
introduce the first set of web-scale Time-Continual (TiC) benchmarks for
training vision-language models: TiC-DataCompt, TiC-YFCC, and TiC-RedCaps with
over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first
use our benchmarks to curate various dynamic evaluations to measure temporal
robustness of existing models. We show OpenAI's CLIP (trained on data up to
2020) loses zero-shot accuracy on our curated retrieval task from
2021--2022 compared with more recently trained models in OpenCLIP repository.
We then study how to efficiently train models on time-continuous data. We
demonstrate that a simple rehearsal-based approach that continues training from
the last checkpoint and replays old data reduces compute by when
compared to the standard practice of retraining from scratch
NUQSGD: Provably communication-efficient data-parallel SGD via nonuniform quantization
As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods